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Abstract 

Andresen and Spokoiny’s (2013) “critical dimension in semiparametric 
estimation “ provide a technique for the finite sample analysis of profile 
M-estimators. This paper uses very similar ideas to derive two conver¬ 
gence results for the alternating procedure to approximate the maxi¬ 
mizer of random functionals such as the realized log likelihood in MLE 
estimation. We manage to show that the sequence attains the same 
deviation properties as shown for the profile M-estimator in Andresen 
and Spokoiny (2013), i.e. a finite sample Wilks and Fisher theorem. 

Further under slightly stronger smoothness constraints on the random 
functional we can show nearly linear convergence to the global maximizer 
if the starting point for the procedure is well chosen. 
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Convergence of an alternation procedure 


1 Introduction 

This paper presents a convergence result for an alternating maximization procedure to 
approximate M-estimators. Let Y £ y denote some observed random data, and IP 
denote the data distribution. In the semiparametric profile M-estimation framework the 
target of analysis is 


6* = IJgv* = IIq argmax ]EjpL(v, Y), (1.1) 

V 

where £ : T x y —» 1R, IIq : T —» M p is a projection and where T is some high 
dimensional or even infinite dimensional parameter space. This paper focuses on finite 
dimensional parameter spaces T C M p with p* = p + m £ N being the full dimension, 
as infinite dimensional maximization problem are computationally anyways not feasible. 
A prominent way of estimating 6* is the profile M-estimator (pME) 

0 = f IIqv c = argmax£(0, rj). 

(0>u) 

The alternating maximization procedure is used in situations where a direct computation 
of the full maximum estimator (ME) v £ M p * is not feasible or simply very difficult to 
implement. Consider for example the task to calculate the pME where with scalar random 
observations Y = (yi)f =1 C 1R , parameter v = (0,r}) £ HRP x lR m and a function basis 
(c fc ) C L 2 (1R) 

-t n m ^ 

L{6,ri) = --Y^\yi-J2 r )k e k( x J d ) • 

*=1 k =0 

In this case the maximization problem is high dimensional and non-convex (see Section 
3 for more details). But for fixed 0 £ S\ C M p maximization with respect to rj £ M m 
is rather simple while for fixed rj £ IR m the maximization with respect to 9 £ IR P can 
be feasible for low p £ N. This motivates the following iterative procedure. Given some 
(data dependent) functional L : M p x M m —>• IR and an initial guess vq £ lR p+rn set for 
k£ N 


Vk,k +1 l = (Ok,Vk+i) = ( 0fc,argmax£(0 fc ,T7) 




Vk,k l = (Ok, ilk) 


( argmax L(0,r) k ),r] k 
\ e&iRp 


( 1 . 2 ) 


The so called ” alternation maximization procedure” (or minimization) is a widely applied 
algorithm in many parameter estimation tasks (see Jain et al. (2013), Netrapalli et al. 
(2013), Keshavan et al. (2010) or Yi et al. (2013)). Some natural questions arise: Does 
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the sequence (Of.) converge to a limit that satisfies the same statistical properties as 
the profile estimator? And if the answer is yes, after how many steps does the sequence 
acquire these properties? Under what circumstances does the sequence actually converge 
to the global maximizer v ? This problem is hard because the behavior of each step of the 
sequence is determined by the actual finite sample realization of the functional £(-,¥). 
To the authors’ knowledge no general ’’convergence” result is available that answers the 
questions from above except for the treatment of specific models (see again Jain et al. 
(2013), Netrapalli et al. (2013), Keshavan et al. (2010) or Yi et al. (2013)). 

We address this difficulty via employing new finite sample techniques of Andresen 
and Spokoiny (2013) and Spokoiny (2012) which allow to answer the above questions: 
with growing iteration number k € N the estimators 0 & attain the same statistical 
properties as the profile M-estimator and Theorem 2.2 provides a choice of the necessary 
number of steps K € N. Under slightly stronger conditions on the structure of the 
model we can give a convergence result to the global maximizier that does not rely on 
unimodality. Further we can address the important question under which ratio of full 
dimension p* = p + m € N to sample size n € N the sequence behaves as desired. For 
instance for smooth L our results become sharp if p*/y/n is small and convergence to 
the full maximizer already occurs if p* jn is small. 

The alternation maximization procedure can be understood as a special case of the 
Expectation Maximization algorithm (EM algorithm) as we will illustrate below. The 
EM algorithm itself was derived by Dempster et al. (1977) who generalized particular 
versions of this approach and presented a variety of problems where its application can 
be fruitful; for a brief history of the EM algorithm see McLachlan and Krishnan (1997) 
(Sect. 1.8). We briefly explain the EM algorithm. Take observations (X) ~ IPg for some 
parametric family (JPg, 6 6 0) . Assume that a parameter 6 € 0 is to be estimated 
as maximizer of the functional £ C (X, 6) € M, but that only Y € y is observed, where 
Y = fy (X) is the image of the complete data set X G X under some map fy : X —>• y . 
Prominent examples for the map fy are projections onto some components of X if both 
are vectors. The information lost under the map can be regarded as missing data or latent 
variables. As a direct maximization of the functional is impossible without knowledge of 
X the EM algorithm serves as a workaround. It consists of the iteration of tow steps: 
starting with some initial guess 0$ the kth “Expectation step“ derives the functional Q 
via 


Q(0,e k ) = lEg k [L c (X,e) |Y], 

which means that on the right hand side the conditional expectation is calculated under 
the distribution Pg k . The kth ’’Maximation step” then simply locates the maximizer 
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6 k+1 of Q. 

Since the algorithm is very popular in applications a lot of research on its behaviour 
has been done. We are only dealing with a special case of this procedure so we restrict our 
selves to citing the well known convergence result by Wu (1983). Wu presents regularity 
conditions that ensure that L(9 k + 1 ) > £(9 k ) where 

L(6, Y) = f log [ exp L c {X,9)dX, 

J{X\Y=f Y (X)} 

such that £(9 k ) —>• L* for some limit value L* > 0, that may depend on the starting 
point 6 o . Additionally Wu gives conditions that guarantee that the sequence 9 k (pos¬ 
sibly a sequence of sets) converges to C(L*) '= {0\ L{9) = £*} . Dempster et al. (1977) 
show that the speed of convergence is linear in the case of point valued 0 k and of some 
differentiability criterion being met. A limitation of these results is that it is not clear 
whether L* = sup£(0) and thus it is not guaranteed that C(£*) is the desired MLE 
and not just some local maximum. Of course this problem disappears if £(•) is unimodal 
and the regularity conditions are met but this assumption may be too restrictive. 

In a recent work Balakrishnan et al. (2014) present a new way of addressing the 
properties of the EM sequence in a very general i.i.d. setting, based on concavity of 
6 i —y lEg * [£ C (X, 0)] . They show that if additional to concavity the functional L c is 
smooth enough (First order stability) and if for a sample (Y j) with high probability an 
uniform bound holds of the kind 


sup V argmax E e [£ C (X, 0°) | Y t ] 
o&b t {o*) i=1 9° 


argmax IEq* [JEg [£ C (X, 0°)|Y]] 

9 ° 


< e n , (1.3) 


that then with high probability and some p < 1 


e k -e *|| <p k \\o 0 -o*\\ + ce n . 


(1.4) 


Unfortunately this does not answer our two questions to full satisfaction. First the bound 
(1.3) is rather high level and has to be checked for each model, while we seek (and find) 
properties of the functional - such as smoothness and bounds on the moments of its 
gradient - that lead to comparably desirable behavior. Further with (1.4) it remains 
unclear whether for large k £ N the alternating sequence satisfies a Fisher expansion or 
whether a Wilks type phenomenon occurs. In particular it remains open which ratio of 
dimension to sample size ensures good performance of the procedure. Also the actual 
convergence of 6 k —>• 6* is not implied, as the right hand side in (1.4) is bounded from 
below by Ce n > 0 . 
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Remark 1.1. In the context of the alternating procedure the bound (1.3) would read 

;max£(0, rj e °) — argrnax 1EL(9, r] e o ) < e n , 
e e 

which is still difficult to check. 


max 

0°&B t (0*) 


To see that the procedure (1.2) is a special case of the EM algorithm denote in 
the notation from above X = (argmax^ £{(#, 77 ), Y}, Y) - where 9 is the parameter 
specifying the distribution IPq - and /y(X) = Y. Then with £ C (0,X) = L c (9,r],Y) = f 
L(6,ri) 

Q(d,e k - 1 ) = lEg fc _ i [T c ( 6 »,X)|Y] = L c (j), argmax£{(0fc_i, 77 ), ¥}, = L(9,rj k ), 

and thus the resulting sequence is the same as in (1.2). Consequently the convergence 
results from above apply to our problem if the involved regularity criteria are met. But 
as noted these results do not tell us if the limit of the sequence (9 k ) actually is the 
profile and the statistical properties of limit points are not clear without too restrictive 
assumptions on L and the data. 

This work fills this gap for a wide range of settings. Our main result can be sum¬ 
marized as follows: Under a set of regularity conditions on the data and the functional 
L points of the sequence ( 9 k ) behave for large iteration number k € N like the pME. 
To be more precise we show in Theorem 2.2 that when the initial guess vq € T is good 
enough, then the step estimator sequence (9 k ) satisfies with high probability 


D{9 k -9*)-l\\ 2 < e(p* + p k Ro), 


max£(0fc, 77 ) — max L (9* ,rj) 
v v 


l€ll 2 /2 


< (p + x) 1/2 e(p* + p fc Ro), 


where p < 1 and e > 0 is some small number, for example e = C p* /^/n in the smooth 
i.i.d setting. Further Ro > 0 is a bound related to the quality of the initial guess. The 
random variable ^ € M p and the matrix D G ]R pxp are related to the efficient influence 
function in semiparametric models and its covariance. These are up to p k Ro the same 
properties as those proven for the pME in Andresen and Spokoiny (2013) under nearly 
the same set of conditions. Further in our second main result we manage to show under 
slightly stronger smoothness conditions that ( 9 k ,rj k ) approaches the ME v with nearly 
linear convergence speed, i.e. \\T)((9 k ,r] k ) — t5)|| < r fc / log ( fc ) with some 0 < r < 1 and 
D 2 = 1E\7 2 L(v*) (see Theorem 2.4). 

In the following we write Cfe > fc(+i) in statements that are true for both v k)k+ i and 
v k}k . Also we do not specify whether the elements of the resulting sequence are sets or 
single points. All statements made about properties of v k >fe ( +1 ) are to be understood in 
the sense that they hold for “every point of v k ^ + i- ) “. 
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1.1 Idea of the proof 

To motivate the approach first consider the toy model 

Y = v* + e, where e ~ W(0,F“ 2 ), F 2 * =: 

In this case we set L to be the true log likelihood of the observations 

L(v,Y) = - ||F(a*-Y)|| 2 /2. 

With any starting initial guess v$ € M p+m we obtain from (1.2) for k € N and the 
usual first order criterion of maximality the following two equations 

Fe* (Ok ~ 0*) = I e *e e + ¥~}A(rj k - v *), 

VrriVk+i - V*) = Ir l *e v +¥-}A T (9 k - 9*). 

Combining these two equations we derive, assuming ||Fg» 1 AF“ 2 ^4 T /^"* 1 || =: ||M"o|| = v < 

1 

¥ e *(0 k - 9*) = ¥~}(¥ 2 e *eo - Ae^) + F 9 , 1 lF-, 1 l T F fl . 1 F fl .(0 w - 9*) 

k 

= ^M k o - l ¥-}(¥ 2 0 ,s e -Ae ri ) 

i=i 

+Mq¥q* (Oq -e*)^¥g.(0-0*). 

Because the limit 9 is independent of the initial point vq and because the profile 9 is 
a fix point of the procedure the unique limit satisfies 6 = 9. This argument is based on 
the fact that in this setting the functional is quadratic such that the gradient satisfies 

V<C(u) =F 2 ,(^-«*) + F 2 ,e. 

Any smooth function is quadratic around its maximizer which motivates a local linear 
approximation of the gradient of the functional L to derive our results with similar 
arguments. This is done in the proof of Theorem 2.2. 

First it is ensured that the whole sequence (vk t k(+i))kGN 0 satisfies for some Ro > 0 

{Ufc,fc(+i)> k € N 0 } C {||D(u - v*)|| < R 0 }, (1-5) 

where D 2 d = V 2 IEX (v*) (see Theorem 4.3). In the second step we approximate with 
C = L - 1EL 


(n* a \ 
v k* ) ' 


£(v,v*) = V((v*)(v — v*) — \\T)(v — v*)\\ 2 /2 + a(v, v*), 


( 1 . 6 ) 
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where a(v,v*) is defined by (1.6). Similar to the toy case above this allows using the 
first order criterion of maximality and (1.5) to obtain a bound of the kind 

k 

|PK, fc -t/)|| < C^p 1 (||!D- 1 VC(t;*)|| + \a{ Vlil ,v*)\) 

1=0 

< Cl (||D- 1 VC(«*)|| + e(Ro)) + p k R o = f r fc . 


This is done in Lemma 4.5 using results from Andresen and Spokoiny (2013) to show 
that e(Ro) is small. Finally the same arguments as in Andresen and Spokoiny (2013) 
allow to obtain our main result using that with high probability for all A: € No £ifc fc € 
{112^-011 < rfc} . For the convergence result similar arguments are used. The only 
difference is that instead of (1.6) we use the approximation 

L(v,v) = -||D(u - v)\\ 2 /2 + ot'(v,v), 


exploiting that V£(u) = 0, which allows to obtain actual convergence to the ME. 

It is worthy to point out two technical challenges of the analysis. First the sketched 
approach relies on (1.5). As all estimators (Cfc^+i)) are random this means that we 
need with some small j3 > 0 


P 



,k+ 1 G (||2)(u 


V 


< Ro} 


>1-/3. 


This is not trivial but the result of Theorem 4.3 serves the result thanks to ^(Sfc i fc(+i)) > 
£(vo). Second the main result 2.2 is formulated to hold for all k G No . This implies the 
need of a bound of the kind 


P 



£ _1 {V C(5fc,fc)-VC(u*)} 



>1-/3, 


with some small e(r) > 0 that is decreasing if r > 0 shrinks. Again this is not trivial 
and not a direct implication of the results of (Andresen and Spokoiny, 2013) or Spokoiny 
(2012). We manage to derive this result in the desired way in Theorem 8.2, which is an 
adapted version of Theorem D.l of (Andresen and Spokoiny, 2013) based on Corollary 
2.5 of Spokoiny (2012) . 
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2 Main results 


2.1 Conditions 

This section collects the conditions imposed on the model. We use the same set of 
assumptions as in Andresen and Spokoiny (2013) and this section closely follows Section 
2.1 of that paper. 

Let the full dimension of the problem be finite, i.e. p* < oo . Our conditions involve 
the symmetric positive definite information matrix D 2 € M p * xp * and a central point 
v° € 1R P * . In typical situations for p* < oo , one can set v° = v* where v* is the “true 
point” from (1.1). The matrix D 2 can be defined as follows: 

D 2 = -X7 2 1EL(v 0 ). 


Here and in what follows we implicitly assume that the log-functional function £(v): 1R P * 
—>• ]R is sufficiently smooth in v € 1RP * , V£(u) € M p * stands for the gradient and 
\7 2 1EL(v) e lR p * xp * for the Hessian of the expectation 1EL : 1R P * —> 1R at v € ]R P * . 
By smooth enough we mean that we can interchange VIST = 1EV£ on T 0 (Ro), where 
T’o(r) is defined in (2.1) and Ro > 0 in (2.4). It is worth mentioning that T> 2 = 
"V 2 = f Co v(\/L(v*)) if the model Y ~ 1P V * G {IPv) is correctly specified and sufficiently 
regular; see e.g. Ibragimov and Khas’minskij (1981). 

In the context of semiparametric estimation, it is convenient to represent the infor¬ 
mation matrix in block form: 


T > 2 


D 2 A \ 
A t H 2 ) ' 


First we state an identifiability condition. 


(I) It holds for some p < 1 


|| h- 1 a t d~ 1 \\ 0O < ^ ~p . 

Remark 2.1. The condition (X) allows to introduce the important p x p efficient 
information matrix D 2 which is defined as the inverse of the 0 -block of the inverse of 
the full dimensional matrix D 2 . The exact formula is given by 

D 2 d = D 2 - AH~ 2 A t , 


and ( I ) ensures that the matrix D 2 is well posed. 
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Using the matrix T> 2 and the central point v° £ JR P , we define the local set T 0 ( r) C 
T C 1R P * with some r > 0: 

r o (r) = f I'D = (9,rj) £ T: ||D(u -«°)|| < r}. (2.1) 

The following two conditions quantify the smoothness properties on X 0 (r) of the expected 
log-functional lEL(v ) and of the stochastic component C( v ) = ~ JEL(v). 

(£) For each r < ro, there is a constant 5(r) such that it holds on the set T 0 ( r): 

\\D~~ 1 D 2 [v)D~ l - I p \\ < <5(r), \\D-\A{v) - A)H~ 1 \\ < 6( r), 

|| D^AH- 1 (I m - || < <5(r). 


Remark 2.2. This condition describes the local smoothness properties of the function 
lEL(v). In particular, it allows to bound the error of local linear approximation of the 
projected gradient S7glEL(v) which is defined as 


V 0 = \7 e - AH~ 2 V v . 

Under condition (£q) it follows from the second order Taylor expansion for any v,v' £ 
T 0 (r) (see Lemma B.l of Andresen and Spokoiny (2013)) 

j| D- 1 (viEL(v) - VIEX(u>*)) - D(6 - 0*)|| < 5(r)r. (2.2) 


In the proofs we actually only need the condition (2.2) which in some cases can be weaker 
than (£q) • 

The next condition concerns the regularity of the stochastic component ((v) = 
£(t>) — ]EL(v ). Similarly to Spokoiny (2012), we implicitly assume that the stochastic 
component ((v) is a separable stochastic process. 


(£Di) For all 0 < r < ro, there exists a constant u <1/2 such that for all \/jl\ < g 
and v, v 1 € T 0 (r) 


sup sup log IE exp 

v,v'eY 0 (r) Ill'll <1 




UJ 




The above conditions allow to derive the main result once the accuracy of the sequence 
is established. We include another condition that allows to control the deviation behavior 
of ||.D _1 VC('U*)|| . To present this condition define the covariance matrix V 2 £ 1RP xp 
and V 2 € M px P 


V 2 d = Var{V£(t? 0 )}, V 2 = Cov(V 0 ((v°)). 
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V 2 € M p xp describes the variability of the process X(v) around the central point v° 
(£Do) There exist constants uq > 0 and g > 0 such that for all \p\ < g 


, jp J (VflC(f°),7) 1 ^ ^oh- 2 

sup log Jh exp < /x-c- > < ——. 


■yelRP 




So far we only presented conditions that allow to treat the properties of 6 & on local 
sets T 0 (r/c) . To show that is not to large the following, stronger conditions are 
employed: 

(£o) For each r < ro , there is a constant 6(r) such that it holds on the set T a (r): 

|| 'D- 1 {V 2 JE£(v)}V~ 1 - 7 p .|| < S(r). 


(ED i) There exists a constant oj < 1/2, such that for all |/x| < g and all 0 < r < rg 


sup sup log IE exp 
v,v'eT 0 (r) Il'T'll=i 


/ X7 T I>- 1 {VCH-VC(^)} 1 

w||!D(u-t>')|| J~ 2 


(£T>o) There exist constants vq > 0 and g > 0 such that for all \p\ < g 


sup log IE exp 

7SIRP* 



(VCfo°),7) ) 

IIV7II J 


< 


vfa 2 


It is important to note, that the constants &,5(r),i> and uj,S(t),u in the respective 
weak and strong version can differ substantially and may depend on the full dimension 
p* € N in less or more severe ways ( AH^ 2 'V ri L might be quite smooth while \/ v L 
could be less regular). This is why we use both sets of conditions where they suit best, 
although the list of assumptions becomes rather long. If a short list is preferred the 
following lemma shows, that the stronger conditions imply the weaker ones from above: 


Lemma 2.1. [Andresen and Spokoiny (2013), Lemma 2.1] Assume (X) . Then (£Di) 
implies (ED 1 ), (£q) implies (Lq) , and (ED 0 ) implies (SDq) with 


g = 




1 + p^l + p 2 


g, v = 


1 + p\J\ + p 2 


v, <5(r) = 5(r), and u = u. 


Finally we present two conditions that allow to ensure that with a high probability 
the sequence (ufc,fc(+i)) stays close to v* if the initial guess vq lands close to v* . These 
conditions have to be satisfied on the whole set T C M p . 
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(Xr) For any r > ro there exists a value b(r) > 0, such that 

—JEL(v, v°) 

||B(„-„°)|P S b(r) ' " £ r “ (r) ' 

(£r) For any r > tq there exists a constant g(r) > 0 such that 


sup sup sup log IE exp < p 

v£Y 0 (t) /x<g(r) 7GiRP* l 

We impose one further merely technical condition: 


P7ll 


(VC(«),7> 1 / 


< 


(Bi) We assume for all r > ^-y/x + 4 p* 


3X 


1 + y/x + 4p* < — L g(r). 

b 


Remark 2.3. Without this the calculation of Rq(x) in Section 4.1 would become tech¬ 
nically more involved, without that further insight would be gained. 

Remark 2.4. For a discussion on how restrictive these conditions are we refer the reader 
to Remark 2.8 and 2.9 of Andresen and Spokoiny (2013). 


2.2 Introduction of important objects 

In this section we introduce all objects and bounds that are relevant for Theorem 2.2. 
This section is quite technical but necessary to understand the results. 

First consider the p* x p* matrices D 2 and V 2 from Section 2.1, which could be 
defined similarly to the Fisher information matrix: 

D 2 d = -V 2 1EL(v*), V 2 d = Cov(V£(u*)). 


We represent the information and covariance matrix in block form: 


T> 2 


° 2 A ) v 2 =f y2 E \ 

A t H 2 ) ’ \ E T Q 2 ) ' 


A crucial object is the constant 0 < p defined by 


D^AH^W 2 


def 

= P, 


which we assume to be smaller 1 (|| • || here and everywhere denotes the spectral norm 
when its argument is a matrix). It determines the speed of convergence of the alternating 
procedure (see Theorem 2.2). Define also the local sets 

T’o(r) c = {n : (v — v*) T V 2 (v — v*) < r 2 }, 

T’o(r) {n : (v — v) T D 2 (v — v) < r 2 }, 
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and the radius ro > 0 via 



r 

( - ) 

1 

r 0 (x) = f inf < 
r>0 

IP 

< 

argmax£(r>), v € T 0 (r) 
veT 

\n g v=e* y 



Remark 2.5. This radius can be determined using conditions (£ r ) and (fir) of Section 
2.1 and Theorem 4.3 which would yield ro(x) = Ci/x + p* . 

Further introduce the p x p matrix D and the p -vectors Vg and £ as 
D 2 = D 2 — AH~ 2 A t , Vg = Vg - AH~ 2 \7 v , $ = T> _1 Vg, 
and the matrices 

S3 2 d = D" 1 !? 2 !)" 1 , B g d = T)"V 2 T> _1 ; B v = f R _1 Q 2 R _1 . 

Remark 2.6. The random variable £ € M p is related to the efficient influence function 
in semiparametric models. If the model is regular and correctly specified D 2 is the 
covariance of the efficient influence function and its inverse the semiparametric Cramer- 
Rao lower bound for regular estimators. The matrices IB, JBg, IB^ describe the miss 
specification of the model and are related to the White-statistic. 

For our estimations we need the constant 

3 (x) 3 (x, IB) V 3 q(x, 4 p*) « yjp* + x, 

where 3 (x, •) is explained in Section 7 and 3 q(x, •) is defined in Equation (8.2). 

Remark 2.7. The constant 3 (x) is only introduced for ease of notation. This makes 
some bounds less sharp but allows to address all terms that are of order yjp* + x with 
one symbol. The constant j(x, IB) is comparable to the ” 1 — e -x ’’-quantile of the norm 
of R~ 1 'VX, where X ~ 74(0, 1 dp*) , i.e. it is of order of the trace of IB. The constant 
3 q(x, (Q>) arises as an exponential deviation bound for the supremum of a smooth process 
over a set with complexity described by Q. 

To bound the deviations of the points of the sequence (v k ^ + 1 )) we need the following 
radius: 

R 0 (x,K 0 ) ( = 3 (x) V ^x + 2Ap* + ^K 0 (x), (2.4) 

which will ensure {£>o,i>o,i) • • •} C T 0 (Ro), where Ko(x) > 0 is dehned as 
K 0 (x) = f inf {P{L(v 0 ,v*) > -K) > ^(x)}, 
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for some /3(x) —>• 0 as x —>• oo, see condition (Ai) in 2.3. Finally define the parametric 
uniform spread and the semiparametric uniform spread 

0 <?(r,x) {<5(r)r+ 6 i/iw( 3 Q (x, V)+ 2r 2 )} , 

<>Q(r,x) = f — 8 ^ 2 5( r ) r + ( 3 Q (x, 2 p* +2pf + 2r 2 ) . (2.5) 

Remark 2.8. This object is central to our analysis as it describes the accuracy of our 
main result of Theorem 2.2. It is small for not too large r , if uj, 5 from conditions (£Di), 
(To) from Section 2.1 are small (with Lemma 2.1 it suffices that uj, 5 from (£2)i), (To) 
are small). Qq(t,x) is structurally slightly different from <C>(r,x) in Andresen and 
Spokoiny (2013) as it is based on Theorem 8.2 and allows a ” uniform in k ” formulation 
of our main result Theorem 2.2, but for moderate x € they are of similar size. 

2.3 Dependence on initial guess 

Our main theorem is only valid under the conditions from Section 2.1 and under some 
constraints on the quality of the initial guess vq € 1R P which we denote by (Ai), (A 2 ) 
and (A 3 ): 

(Ai) With probability greater 1—/3 (a)( x ) the initial guess satisfies £>(vq,v*) > — Ko(x) 
for some Ko (x) > 0 . 

(A 2 ) The conditions (£Di), (To), (£Di) and (To) from Section 2.1 hold for all r< 
Ro(x, Ko) where Rois defined in (2.4) with /3(x) = /3(A) ( x ) • 

(A 3 ) There is some e > 0 such that <5(r)/r V 12 ^ 0 ; < e for all r < Ro . Further 
Kq(x) € 1R and e > 0 are small enough to ensure 


c(e,3( x )) - e7C(p) (3(x) + £3(x) 2 ) < 1, 

1 — p 

(2.6) 

c(e,R 0 ) d = e7C(p)—^ —R 0 < 1, 

1 ~ P 

(2.7) 

C(p) d =l f 2^2(1+ VP)(1-VP)' 1 - 

(2.8) 


Remark 2.9. One way of obtaining condition (Ai) is to show that v € T 0 (Rk) with 
probability greater 1 — /3(a) (x) for some finite Rk( x) G M and 0 < /3(A) (x) < 1 . Then 
(see Section 4.1) 

K 0 (x) = f (1/2 + 12uquj)R 2 k + ( 8 (R k ) + 3 (x))Rk + 6u 0 w3(x) 2 . 
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Condition (Ai) is specified by conditions (A 2 ) and (A 3 ) and is fundamental, as it allows 
with dominating probability to concentrate the analysis on a local set T a (Rq(x)) (see 
Theorem 4.3). Conditions (A 2 ) and (A 3 ) impose a bound on Ro(x) and thus on Ko 
from (Ai). These conditions boil down to <5(Ro)+wRo being signihcantly smaller than 1 . 
Condition (A 3 ) ensures that the quality of the main result from Andresen and Spokoiny 
(2013) can be attained, i.e. that ffQ(r k ,x) ~ <C>( r 0 i x ) under rather mild conditions 
on the size Ro , as we only need eRo to be small. A violation of (A 2 ) would make it 
impossible to apply Theorem 8.1 the backbone of our proofs. 

Remark 2.10. In the case of iid observations with sample size n one often has <5(Ro) + 
wRo < CRo(x)/y / n which suggests at first glance that (A 2 ) and (A 3 ) are only a question 
of the sample size. But note that in case of iid observations the functional satisfies 
n~—L(vo,v*) such that the conditions (A 2 ) and (A 3 ) are not satisfied automatically 
with sufficiently large sample size. They are true conditions on the quality of the first 
guess. 

2.4 Statistical properties of the alternating sequence 

In this Section we present our main theorem in full rigor, i.e. that the limit of the 
alternating sequence satisfies a finite sample Wilks Theorem and Fisher expansion. 

Theorem 2 . 2 . Assume that the conditions (£Do), (£Di), (£ 0 )> (£ r ) o,nd (fir) of 
Section 2.1 are met with a constant b(r) = b and where V 2 = Cov (V£(u*)) , T 2 = 
—\7 2 1EL(v*) and where v° = v* . Assume that (£D 1 ) and (Lq) are met. Further 
assume (B 1 ) and that the initial guess satisfies (Ai) and (A 2 ) of Section 2.3. Then it 
holds with probability greater 1 — 8 e -x — /3(A) for all k G N 

\\D(0 k -d*) -£|| < $ Q (r fc ,x), (2.9) 

\2L(G k ,6*)-U\\ 2 \ < 8(|||||+^ Q (r fc ,x))^ Q (2(l + /0 )r fc ,x) (2.10) 

+<C> 0 (r fc ,x) 2 , 


where 


r k < 2\/2(l - y/p) 1 |(3(x) + O q (R 0 ,x)) + (1 + y/p)p k R 0 (x )| . 
If further condition (A 3 ) is satisfied then (2.9) and (2.10) are met with 

r k < C (p) (a(x) + e 3 (x) 2 ) + ^ _ 7 c ( e ^( x )) (l^)) + €3(x)2 ) 

+/ (own, + q I c C ( ( £ rf flo) 


Rl 
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In particular this means that if 

, ^ 2 log( 3 (x)) — log{ 2 i? 0 ( x ) ^ 0 )} 
log(p) 

we have with 3 (x ) 2 < C s (p* + x) 

<>QOfc,x) « <C> Q (cyV + x,x) . 

Remark 2.11. Note that the results are very similar to those in Andresen and Spokoiny 
(2013) for the profile M estimator 0 . This is evident after noting that (ignoring terms 
of the order 63 (x)) 

rfc < C(p) (a(x) + p k (Ro + CeRg)^ , 
which for large k G N means < C(p)$(x). 

Remark 2.12. Concerning the properties of £ E M p we repeat remark 2.1 of Andresen 
and Spokoiny (2013). In the case of the correct model specification the deviation proper¬ 
ties of the quadratic form ||£|| 2 = ||-D~V 0|| 2 are essentially the same as of a chi-square 
random variable with p degrees of freedom; see Theorem 7.1 in the appendix. In the 
case of a possible model misspecification with, the behavior of the quadratic form ||||| 2 
will depend on the characteristics of the matrix IB D~ l Coiv(V/G(u*)).D -1 ; see again 
Theorem 7.1. Moreover, in the asymptotic setup the vector £ is asymptotically standard 
normal; see Section 2.2. of Andresen and Spokoiny (2013) for the i.i.d. case. 

Remark 2.13. These results allow to derive some important corollaries like concentra¬ 
tion and confidence sets (see Spokoiny (2012), Section 3.2). 

Remark 2.14. In general an exact numerical computation of 

#( 77 ) = f argmax£( 0 , 77 ), or 77 ( 0 ) '= argmax£( 0 , 77 ), 

0eJR p ridUR 171 

is not possible. Define 0 ( 77 ) and rj(6) as the numerical approximations to 0 ( 77 ) and 
77 ( 0 ) and assume that 

\\D(6(r]) - 0 ( 77 ))|| < r, for all 77 € To,r?(Ro) = f G T o (R 0 ), n v v = 77 }, 

\\H(rj(0) - 77 ( 0 ))|| < r, for all 0 € ^(Rq) = f {v G T o (R 0 ), n 0 v = 0}. 

Then we can easily modify the proof of Theorem 2.2 via adding C(p)r to the error terms 
and the radii , where C (p) is some rational function of p. 

Remark 2.15. Note that under condition (A 3 ) the size of for k —>• 00 does not 
depend on Ro > 0 . So as long as eRo is small enough the quality of the initial guess no 
longer affects the statistical properties of the sequence (0*,) for large k € N. 
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2.5 Convergence to the ME 

Even though Theorem 2.2 tells us, that the statistical properties of the alternating se¬ 
quence resemble those of its target, the profile ME, it is an interesting question if the 
underlying approach allows to qualify conditions under which the sequence actually at¬ 
tains the maximizer v. Without further assumptions Theorem 2.2 yields the following 
Corollary: 

Corollary 2.3. Under the assumptions of Theorem 2.2 it holds with probability greater 
1 - 8 e -x - /3 (A) 

\\b(0 - G k )|| < </>< 2 (r fc ,x) + <>(r 0 ,x), 


where ro > 0 is defined in (2.3) and 

<>( r ,x) = f T —^- 2 - 2 < 5 (r)r + 6 viW 3 i(x ,2 p* +2p)r. 

(1 - p z y 

Remark 2.16. The value Ji(x, •) is defined in (2.11). 

Corollary 2.3 is a first step in the direction of an actual convergence result but the gap 
<0>Q(r^,x) + <C>(ro,x) is not a zero sequence in k £ N. It turns out that it is possible to 
prove convergence to the ME with the cost of assuming more smoothness of the functional 
L and using the right bound for the maximal eigenvalue of the hessian V 2 £(l>*) . 

Consider the following condition, that basically quantifies how ’’well behaved” the 
second derivative V 2 (£ — 1EL) is: 


(£© 2 ) There exists a constant oj < 1/2, such that for all \p,\ < g and all 0 < r < ro 


sup sup sup log IE exp 
v,v'£To(r) ||tt 11=1 II 72 IH 1 


hljv X { v2 C(^)-V 2 C(^)}72 l < u lh 2 

u 2 \\V{v - v')\\ ) 2 


Define ^(x, V 2 £(v*)) via 

JP {||© _1 V 2 £(t>*)|| > 3 (x, V 2 £(t>*))} < e _x , 

and x(x, Ro) 

x(x,R 0 ) = [<5(Ro) + 9cu 2 u 2 ||R” 1 || 3 l (x,6p*)R 0 + H©- 1 ^ (x,V 2 T(r/))] , 

where ji(x, •) satisfies (see Theorem 9.2) 

3 l (x,Q)=J v ' 2<X + Q) if x/ 2 (x + Q) < go. 

[ g 0 x (x + Q) + go /2 otherwise. 


( 2 . 11 ) 
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Remark 2.17. For the case that £(i?) = Y27=i ^i( v ) with a sum of independent marginal 
functionals ti'.T —» 1R we can use Corollary 3.7 of Tropp (2012) to obtain 


3 (x,V 2 £(«*)) = ^3^+7, 


if with a sequence of matrices (Aj) c ]R p * xp * 

n 

log IE exp AV 2 £i(v*) ■< u 2 X 2 /2 A*, |] A;|| < r. 

Z=1 

Remark 2.18. In the case of smooth i.i.d models this means that x(x, Ro) < C(Ro + x + 
log(p*))/y / n +CRoa/ x + p*/n . This means that x(x, Ro) = 0((x + Ro + log(p*))/^/n) if 
p* + x = o(n). 

With these definitions we can prove the following Theorem: 


Theorem 2.4. Let the conditions (£X> 2 ) , (£ 0 ) , (£ r ) and (£r) be met with a constant 
b(r) = b and where T> 2 = —V 2 JE£(v*) and v* = v° . Further suppose (B±) and that 
the initial guess satisfies (Ai) and (A 2 ) ■ Assume that x(x, Rq) < (1 — p) . Then 


IP n {v kM+l) € r o (4)} > 1 - 3e -x - /3 (a) , 

VfceN / 


where 


A < 


P k2 V2 l-„(l,R Q )k R 0 . x ( x ’ R o) k < h 


) 1 -p 

J x(x,ijo) 


r ( x )fc/ lo g(fc)^ 0j otherwise, 


def 


with Rq — Rq + ro and 


Tll) * (atm™ <i 


L{k) = 


1-P J 

log (1/p) - i (log(2\/2) - log(x(x, Ro)k - 1)) 


€ N, 


( 2 . 12 ) 


where € N denotes the largest natural number smaller than x > 0. 


Remark 2.19. This means that we obtain nearly linear convergence to the global max¬ 
imizer v. 


Remark 2.20. As in Remark 2.14 if no exact numerical computation of the stepwise 
maximizers is possible we can easily modify the proof of Theorem 2.4 via adding C (p)r 
to x(x, Rq) , to address that case. 
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2.6 Critical dimension 

In parallel to (Andresen and Spokoiny, 2013) we want to address the issue of critical 
parameter dimensions when the full dimension p* grows with the sample size n . We 
write p* = p n ■ The results of Theorem 2.2 are accurate if the spread function <0>Q(rfc, x) 
from (2.5) is small. The critical size of p* then depends on the exact bounds on <5(-) and 
& . In the i.i.d setting <5(r)/r x w x 1/y H such that <0>( r fc> x ) x p* / y/n for large fc£N. 
In other words, one needs that u p* 2 /n is small” to obtain an accurate non asymptotic 
version of the Wilks phenomenon and the Fisher Theorem for the limit of the alternating 
sequence. This is not surprising because good performance of the ME itself can only be 
guaranteed if u p* 2 /n is small”, as is shown in (Andresen and Spokoiny, 2013). There are 
examples where the pME only satisfies a Wilks- or Fisher result if u p* 2 /n is small”, such 
that in any of those settings the alternating sequence started in the global maximizer 
does not admit an accurate Wilks- or Fisher expansion. 

Interesting enough the constrain x(x, Ro) < (1 — p) of Theorem 2.4 for the conver¬ 
gence of the sequence to the global maximizer means that one needs p*/n <C 1 in the 
smooth i.i.d. setting if Ro < Cr 0 y/p* + x. Further Theorem 2.4 states a lower bound 
for the speed of convergence that in the smooth i.i.d. setting decreases if p*/n grows. 
Unfortunately we were unable to find an example that meets the conditions of Section 
2.1 and where no convergence occurs if p*/n tends to infinity. So whether this dimen¬ 
sion effect on the convergence is an artifact of our proofs or indeed a property of the 
alternating procedure remains an open question. 


3 Application to single index model 

We illustrate how the results of Theorem 2.2 and Theorem 2.4 can be applied in Single 
Index modeling. Consider the following model 


Vi = f( x J d *) + £ o i = 1 , ..., n, 

for some / : M —>■ M and 0* € S^’ + C M p and with i.i.d errors £j € 1R, Var (e*) = cr 2 
and i.i.d random variables Xi € JRP with distribution denoted by P x . The single-index 
model is widely applied in statistics. For example in econometric studies it serves as a 
compromise between too restrictive parametric models and flexible but hardly estimable 
purely nonparametric models. Usually the statistical inference focuses on estimating the 
index vector 6* . A lot of research has already been done in this held. For instance, 
Delecroix. et al. (1997) show the asymptotic efficiency of the general semiparametric 
maximum-functional estimator for particular examples and in Haerdle et al. (1993) the 
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right choice of bandwidth for the nonparametric estimation of the link function is ana¬ 
lyzed. 

To ensure identifiability of 6* € 1RP we assume that it lies in the half sphere S p,+ == 
{0 G M p : || 0 || = 1, 6 \ > 0} C M p . For simplicity we assume that the support of 
the Xj € M p is contained in the ball of radius sx > 0. This allows to approximate 
/ £ {/ : [—sx,sx] M} by an orthonormal C 2 -Daubechies-wavelet basis, i.e. for a 

def 

suitable function = ip : [— sx, sx] ^ 12 we set for k = (2 3k — 1)13 + r k with j k € No 
and r k € {0,(2^)13 - 1} 

e k (t) = 2 3k ^ 2 ip (2 3k (t - 2 r fcSx )) , k € N. 


A candidate to estimate 6* is the profile ME 


0 m = n 0 argmax £ m (0, r/), 
(0,r7)6T m 

where 


^ /1* / / L 

£m(0, v) = - ^ Yl\ Vi ~ 0 ) 

«=1 fc=0 


and where T rn c S p ' + x B™ C M p x M m where B™ C lR m denotes the centered ball of 
radius r° > 0 for some r° > 0. Ichimura (1993) analyzed a very similar estimator in a 
more general setting based on a kernel estimation of ]E\y | /( 0 T X)] instead of using a 
parametric sieve approximation r lk e k ■ He showed \Jn -consistency and asymptotic 

normality of the proposed estimator. 

In this setting a direct computation of v becomes involved, as the maximization 
problem is high dimensional and not convex. But as noted in the introduction the rnaxi- 
miziation with respect to rj for given 6 is high dimensional but convex and consequently 
feasible. Further for moderate p € N the maximization with respect to 6 for fixed r/ 
is computationally realistic. So an alternating maximization procedure is applicable. To 
show that it behaves in a desired way we apply the technique presented above. 

For the initial guess vq S Y one can use a simple grid search. For this generate a 
uniform grid Gx c = (0i,... , Ox) C Sf and define 


vq = argmax L(v). (3.1) 

(e,r7)er- 
o&g n 


Note that given the grid the above maximizer is easily obtained. Simply calculate 


Vo,k = argmax L(d k ,r)) 



n 

E eeT (xJe k ) 


i 

-T yi e T (Xje k )e]R m , (3.2) 
n 


-l 
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where by abuse of notation e = (ei,. .., e m ) € P m . Now observe that 

v 0 = argmax£(0fc,Tj ofc )- 

k=l,...,N 

Define r c = sup 0j0 o gGAr ||0 — 0°|| . 

To apply the result presented in Theorem 2.2 and Theorem 2.4 we need a list of 
assumptions denoted by (.4.) . We start with conditions on the regressors X E P p : 

(Condx) The measure P x is absolutely continuous with respect to the Lebesgue 
measure. The Lebesgue density dx '■ P p —>• JR of P x is only positive on the 
ball B Sx { 0) C P p and Lipschitz continuous on B Sx ( 0) C M p with Lipschitz 
constant Lj x > 0. Further we assume that for any 6 1 9* with ||0|| = 1 we 
have Var ( X^ 6 X T 0* ) > for some constant a \> 0 that does not 


’ X\0* u x | fl 

depend on X T 0* E P. Also the density dx : P p —*• P of the regressors satisfies 
Cd x — d-x < Cd x on B Sx (0) C P p for constants 0 < Cd x < Cd x < oo. 


(Condj) For some r/* E l 2 


/ = /r,* = ^Vk e k, 


k=1 


where with some a > 2 and a constant C||rj*|| > 0 


E / 2 ^ 2 <^n< 


oo. 


1=0 


(Condos*) It holds true that P(\f' r) *(X 6* )\ > Cf t ) > cpf for some cp t , cpp > 0 . 

(Cond e ) The errors (ef) E IR are i.i.d. with P[si] = 0, Cov(ej) = cr 2 and satisfy for 
all \fj,\ < g for some g > 0 and some u r > 0 


logIE[exp{^£i}] < v 2 /i 2 /2. 


If these conditions denoted by (.4) are met we can proof the following results: 

Proposition 3.1. Let r = o(p*~ 3 / 2 ) and p* 5 /n —» 0. With initial guess given by 
Equation (3.1) and for x < 2u 2 g 2 n the alternating sequence satisfies (2.9) and (2.10) 
with probability greater 1 — 9exp{—x} and where with some constant C<> E P 


❖g(r,x) < 


C o(p*+x) 3/2 , 2 




(r 2 +p* + x). 
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Remark 3.1. The constraint r = o(p*~ 3 ^ 2 ) implies that for the calculation of the initial 
guess the vector rj 0 t of (3.2) and the functional £(•) have to be evaluated N = p .* 3 (p -1 )/ 2 
times. 

Proposition 3.2. Take the initial guess given by Equation (3.1). Assume (A.) but use 
a three times continuously differentiable wavelet basis. Further assume that p* 4 /n —>• 0 
and t = o(p* _3/2 ). Let x > 0 be chosen such that 

x < \ {v 2 ng 2 - log (p*)) A p*. 

Then we get the claim of Theorem 2-4 with /3r A \ = e -x and 

x(x, Rq) = 0 (rm 3y/2 + i/Txm 3 / 2 /n 1 / 4 ) + 0(p* 2 /y/n) -A- 0 , 
for moderate choice of x > 0 . 

For details see Andresen (2014). 


4 Proof of Theorem 2.2 

In this section we will proof Theorem 2.2. Before we start with the actual proof we want 
to explain the agenda. The first step of the proof is to find a desirable set 12(x) C 12 of 
high probability, on which a linear approximation of the gradient of the functional £(v) 
can be carried out with sufficient accuracy. Once this set is found all subsequent analysis 
concerns events in 12(x) C 12. 

For this purpose define for some K £ N the set 

I< 

12(x) = P| (C k ,k n C k ,k+ 1 ) n C(V) n {£(n 0 , «*) > -K 0 (x)}, where (4.1) 

k=0 

Ck M +1 ) = {ll®Kfc(+D - v *)\\ < R o(x), \\D( 0 k - e*)\\ < R o(x), 

\\ H (Vk(+i) -V*)\\ < R o(x)}, 

<?(V) = P| { sup [fr—\\y{v)\\ -2r 2 j < 3 Q (x, 4 p *) 2 

r<R 0 (x) ^ v ^ T °( r ) l ) 

sup {^3-||y(w)|| - 2r 2 ] < 3 q(x, 2 p* - 
rs«o W verowlew^ J 

n j max{||D - 1 V£||, ||D _ 1 Ve£||, ||.H' _ 1 V„,C||} < a(x) 
n{n,n 0 * € T o (r 0 (x))}. 
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For C(v) = £(v) — ]EL(v ) the semiparametric normalized stochastic gradient gap is 
defined as 

y(«) = D- 1 (VeCiv) ~ V e C(^*)) • 

the parametric normalized stochastic gradient gap y(t>) is defined as 

y(^)=Do 1 (vC(^)-VC(^)), 

and ro(x) > 0 is chosen such that P(v,v e * € T 0 (ro)) > 1 — e _x , where 

Vq* = f argmax£(i;). 
v&r 
n e v=o* 

Remark 4.1. We intersect the set with the event { v , vq * € T 0 (ro)} where we a priory 
demand ro(x) > 0 to be chosen such that P(v,vg* € T 0 (ro)) > 1 — e _x . Note that 
condition (fir) together with (£r) allow to set y/p* + x « ro < Ro (see Theorem 4.3). 

In Section 4.1 we show that this set is of probability greater 1 — 8e _x — P(a) ■ We 
want to explain the purpose of this set along the architecture of the proof of our main 
theorem. 

{£(tio, v*) > — Ko(x)}: This set ensures, that the first guess satisfies £(£>o,i>*) 
> — Ko(x), which means that it is close enough to the target v* € JR P . This fact 
allows us to obtain an a priori bound for the deviation of the sequence (S fc fc ( +1 )) C 
T from v* € T 0 (Ro) with Theorem 4.3. 

{®(«fc,fc(+l) “ u *) ^ R o (x)} : As just mentioned this event is of high probability due to 
£(50)^*) > —Ko(x) and Theorem 4.3. This allows to concentrate the analysis on 
the set T 0 (Ro) on which Taylor expansions of the functional L : 1RP —>• M become 
accurate. 

C(V): This set ensures that on 12 (x) C 12 all occurring random quadratic forms and 
stochastic errors are controlled by j(x) € M. Consequently we can derive in the 
proof of Lemma 4.5 an a priori bound of the form ||2)(^fc,fc(+i) ~ ' u *)ll < r k f° r a 
decreasing sequence of radii (r^) C JR+ satisfying lim sup^^ = Cj(x). Further 
this set allows to obtain in Lemma 4.7 the bounds for all k € N. 

On 12(x) C 12 we find v k ^ + i-j € T 0 (r k ) such that we can follow the arguments of 
Theorem 2.2 of Andresen and Spokoiny (2013) to obtain the desired result with accuracy 
measured by <>q( r fc ,x). 
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4.1 Probability of desirable set 

Here we show that the set l?(x) actually is of probability greater 1 — 8 e -x — /3 (a) • We 
prove the following two Lemmas, which together yield the claim. 

Lemma 4.1. The set C'(V) satisfies 

P(C(V)) > 1 - 7e _x . 

Proof. The proof is similar to the proof of Theorem 3.1 in Spokoiny (2012). Denote 

= rS Q w {„SL) {i IWt,)l1 ~ 2r 1 £ 3q(x ’ 4? ‘ )2 } 

-2r 2 | < 3 q(*> V + 2pf 

C d = { max{||D _ 1 V£]|, ||D _ 1 Ve£||, ||ff _ 1 V Tr C||} < 3 (x)}. 

We estimate 

P{C{V)) > 1 - P {A c ) - IP ( B c ) - P (C c ) 

-P(v,v e * ^T o (r o ))-1P(||D- I v 0 || 2 > 3 (x,® 0 )) • 

We bound using for both terms Theorem 8.2 which is applicable due to (£Di) and 

(£2b): 

P(A C ) < e _x , P (B c ) < e _x . 

For the set C C 1? observe that we can use (X) and Lemma 4.2 to find 

||iX~ 1 V T7 || V p-Vell < ||D _ 1 V||. 

This implies that 

{||D- 1 V||< 3 (x,S)} 

C {||D“ 1 Ve|| V ||Lf _ 1 V r? || < 3 (x,1B)}. 

Using the deviation properties of quadratic forms as sketched in Section 7 we find 
iP (||CD- 1 V|| > 3 (x,®)) < 2e _x , 4P(||P )" 1 V|| > 3 (x,lB)) < 2e" x . 



By the choice of 3 (x) > 0 and ro > 0 this gives the claim. 


□ 
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We cite Lemma B.2 of Andresen and Spokoiny (2013): 

Lemma 4.2. Let 

*=(£ H £ K mxm invertible, 

|| D-'AH-'W < 1 . 

Then for any v = (0,rj) € M p+m we have ||if~ 1 T 7 || V ||L)~ 1 0|| < ||D _1 t>|| . 

The next step is to show that the set fi\ k =\(fik,k H C k ,k+ 1 ) has high probability, that 
is independent of the number of necessary steps. A close look at the proof of Theorem 4.1 
of Spokoiny (2012) shows that it actually yields the following modified version: 

Theorem 4.3 (Spokoiny (2012), Theorem 4.1). Suppose (Sr) and (£r) with b(r) = b. 
Further define the following random set 


T(K) = {u £ T : L(v,v*) > -K}. 


If for a fixed ro and any r > ro , the following conditions are fulfilled: 


1 + V* + 2 p* < 3u r 2 g(r)/b 



then 


IP(T(K) C T'o(ro)) > 1 — e -x . 


Note that with (X) 


II D(e k - r)|| v || H(fj k{+1) - r]*)\\ < -L-\\v(v kM+l) - u*)||. 


With assumption ( B \) and 



this implies the desired result as L(v k M+i)jV*) > L(v q,v*) such that with Theorem 
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4.3 


P ( P) (Ck,k n C'fe : fe + i) N j > ip ( P) {Ck,k n c k ,k+i) n {£(n 0 , v*) > — K 0 }'j 

\ fc =0 / U =0 / 

—]P(L(vo,v*) < —K 0 ) 

> IP {r(K 0 (x)) c r o ((l - p)Ro(x)) } - /3(A) 

> 1 - e_x - /3(A)- 

Remark 4.2. This also shows that the sets of maximizers (Cfc^+i)) are nonempty 
and well defined since the maximization always takes place on compact sets of the form 
{6 € mp, (0,rj) € T o (R 0 )} or {rj € M m , (9, V ) € T 0 (Ro)} - 

To address the claim of remark 2.9 we present the following Lemma: 

Lemma 4.4. On the set C(V) n {Co € Y 0 (Rk)} it holds 

£(v 0 ,v*) > -( 1/2 + 12i/ 0 uj)R 2 k - (5(R k ) + }{x))R k - 6 z/ 0 w 3 (x) 2 . 

Proof. With similar arguments as in the proof of Lemma 4.5 we have on C(V) C 17 that 

Mvo,v*) > JE[£(v 0 ,v*)] - ||D - 1 VC(u*)||i2jif - |{VC(«) - VC(n*)}(n 0 - v*)\ 

> -IPK - ^)|| 2 /2 - ||2)- 1 VC(^)||Rat 

— ||2)- 1 {V/C (v) - VL{v*)}\\Rk - Rk5(Rk) 

> -( 1/2 + 12 vqu)R 2 k - ( S(R K ) + 3 (x))R k - 6 z/ 0 u; 3 (x) 2 . 


□ 


4.2 Proof convergence 

We derive the a priori bound Vk,k(+i) £ Lo(r^) with an adequately decreasing sequence 
(r k) C 1R + using the argument of Section 1.1, where limsupr^ ~ $(x). 

Lemma 4.5. Assume that 

42(x) C p| {n fc;fc(+1) € T 0 (4°)} • 

km 
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Then under the assumptions of Theorem 2.2 we get on l?(x) for all k £ No 

||D(w*,fc(+i ) -v*)|| < 2\/2(l - yfp)~ l (j(x) + (1 + y /p)p k R 0 (x)^ 

k -1 

+2V2(l + y/p)Y,P r OQ ( 4 °) 

r=0 

T .(^+ 1 ) 

— • T k 

Proof. 1. We first show that on l?(x) 

D(e k -e*) = D-'VeW)- D-'Aftk-^ + rir®), ( 4 - 2 ) 

H(Vk~V*) = H~ 1 V ri L(v*)-H- 1 A T (d k _ 1 -e*) + T(4 ) ), 

where 

||r(r)|| < 0 Q (r,x) = {5(r)r + 6 v 1 u^ Q (x,Ap*) + 2r 2 )} . 

The proof is the same in each step for both statements such that we only prove the first 
one. The arguments presented here are similar to those of Theorem D.l in (Andresen 
and Spokoiny, 2013). By assumption on J?(x) we have v k ,k(+i) € T 0 ( r 4) • Define with 
C = L-]EL 


a{v,v*) := £>{v,v*) - (V((v*)(v - v*) - ||D(u - v*)\\ 2 /2) . 


Note that 


L(v,v*) = V({v*)(v-v*) - \\D{v — v*)\\ 2 /2 + a(v,v*) 

= \7 0 ((v*)(O - 0*) - || D{6 - 6*) || 2 /2 + (0 - 6 *) T A(r] - r)*) 
+V„C(t>*)fa - V*) - II H(rj - ^)|| 2 /2 + a(v,v*). 

Setting \7 0 L( 6 k ,rj k ) = 0 we find 

D(d k - 0*) - V-\V e t{v*) - A(fj k - v*)) = T>- 1 V 0 a(v k>k ,v*). 

As we assume that v kjk € T 0 (Ro) it suffices to show that with dominating probability 

sup ||'M0,»7 fc )|| < 0(4°), 

(0,ij fc )er o (R o ) 

where 


Ug(0, rj k ) = D-^Vetivk'k) - V 0 £(v*) — D 2 (6 - 6*) - A{rj k - V *)}. 
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To see this note first that with Lemma 4.2 ||D 1 n 0 Dv\\ < ||D l T)v\\ . This gives by 
condition (£q) , Lemma 4.2 and Taylor expansion 


sup \\EU(0,rj k )\\ < sup \\D~ 1 IIq (vE!L(v) — VJEL(v*) — T) (v — v* 

(0,^fe)GTo(r) uSTo(r) 


< sup \\D- l n e T)\\\\T)- l \7 z IEL{vYT)- 1 - I } 

weT 0 (r) 

< <5(r)r. 


For the remainder note that again with Lemma 4.2 

||£> -1 (VeC(u) - VeC(^*)) | < II ® -1 (vC(v) — VC(v*)) 

This yields that on L?(x) 


|!/ 2 , 


sup 

( 0 >» 7 /fc)eT>(r) 

< sup 


U 0 {0,T] k ) - lEU 0 {e,rj k ) 
1 


< sup 


D- 


uSTo(r) 

fill < 6 vilu{$q(x, Ap*) + 2r 2 }. 


'(VeC^-VeC^*)) 


ueTo(r) l 6^1 U} 

Using the same argument for r] k gives the claim. 


2. We prove the apriori bound for the distance of the k. estimator to the oracle 

||D(w*,fc(+i) - v*)|| < r fc +i> - 
To see this we first use the inequality 

mv kM+1) - t;*)|| < y/2\\D(G k - 0*)|| + y/2\\H(rj k(+1) - rf )||. 

Now we find with (4.2) 

\\D(0 k -0*)\\ < ||-D _1 V#£(t;*)|| + \\D~ 1 A(rj k — rj*)\\ + ||r(r^)|| 

< || J D- 1 V 0 L(^)|| + \\D~ 1 AH~ 1 \\\\H(rj k - r?*)|| + ||r(r®)||. 

Next we use that on f?(x) 

WD^AH^W < y/p, ||Z7- 1 V 0 £K)|| < 3 (x), ||ir 1 V, r C(u*)|| < 3 (x), 


\\H(rj k -ri*)\\ < \\H- 1 w r] L(v*)\\ + \\H- 1 A T (d k ^- 
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to derive the recursive formula 

\\D(e k - 0*)|| < (1 + yfp) ( 3 (x) + ||r(r<?) ||) + - e*)\\. 

Deriving the analogous formula for \\H{rj k — r/*)\\ and solving the recursion gives the 
claim. 

□ 


Lemma 4.6. Assume the same as in Theorem 2.2. Then we get 

f?(x) C p| jv fc)fc(+ i) € T 0 (4 1} ) } , 
fceN k 


where 


r[, 1) <2\/2(l -y/p) 1 {( 3 (x) + Oq(^o,x)) + (1 + y/p) p k R 0 (x)} • ( 4 - 3 ) 

Further assume that <5(r)/r V 12ui uj < e and that (2.6) and (2.7) are met with C(p) 
defined in (2.8). Then 


f?(x) C p {v kM+1) € T 0 (4)} , 
keN 


where 


-t < C (P) (3 to + £3(x) 2 ) + e 1 _ 7 c ( 6 (P j (x)) (^) (3 to + £3(x) 2 ) 2 

7 2 C (p) 4 


+P '' c (p)R 0 + £ - 


Rl 


1 - c(e, i? 0 ) \P 1 ~ 1 
Proof. We proof this claim via induction. On I7(x) we have 


Vk, fc(+i) G r o (Ro), set r[ 0) = f R 0 . 


Now with Lemma 4.5 we find that 

l?(x) C n{ Vfc,fc(+i) G 7;(r[ Z) )| implies I7(x) C p {t> fc)fc(+1) 


er o (r 


(i+1) 


fceN 


fee N 


(4-4) 


4° < 2\/2(l - Vp) 1 ( 3 (x) + (1 + Vp)p fc R 0 (x)) 


fc -1 

+2^2(1+ Vp)E^Oq (4' 1) ,x). 

r=0 


where 






ANDRESEN, A. AND SPOKOINY, V. 


29 


Setting l = 1 this gives 

4 1} ^ 2 \ / 2( 1 ~ Vp) _ 1 {(3( x ) + Oq(Ro,x)) + (1 +Vp)/Ro(x)}, 

which gives (4.3). For the second claim we show that 


42(x) C[W v kM+1) <E T 0 


fce N 


I limsup 

V 1 ^ 0 o 



— n {^fc.fei+i) e ^°( r fc)} ■ 

fceN 


So we have to show that limsup^^rj? < r), from (4.4). For this we use <5(r)/r V 
12i/i a; < e to estimate further 


„(0 


< 2V2(1 - yfp) 1 (j(x) + (1 + ^/p)p k R 0 (x)) 


/c-1 


r =0 


+2-\/2(l + yfp)e Y p r ( (r[!_^) + 3 (x) 2 


< 2V2(1 - y/p) 1 (a(x) + e 3 (x) 2 + (1 + v^)p fc R-o(x)) 


k -1 


+2\/2(l + v^£/( r L 1} ) 


r =0 


fc-1 


< 


C (P ) ^ (3(x) + ea(x) 2 ) +p fc R 0 + e^p r (r['_ r 1) ) 2 }> , 


r =0 


where C(p) > 0 is defined in (2.8). We set 


k -1 / k — n—l / fc — n —...— r s - i —1 

4r=E^‘ E W • E /'(4-E-J 2 


n=0 \ r2=0 


r „=0 


2 \ 2 


Claim 


4‘i < I ( t 4_) eU2 ‘ (jW + ^w 2 ) 2 ' 


1 ~P 


+P 


.P _1 - 1 

+ 7 E - ! ‘(cW [ fi”. 


E S —1 oi 
£=0 z 




(4.5) 
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We proof this claim via induction. Clearly 


fc-i 


fc-i 


4‘,l = E ( r tr!) 2 < 7C(p) 2 E {«*) + £ 3« 2 ) 2 + d ( ‘- r ' ) R§} 


ri=0 ri=0 

k —1 /k—ri—r 2 —1 


< 


+ 7C(„)V £/■ ( £ ^(r<‘: 2 U) 2 

7 - 1=0 \ T 2=0 / 

7C(p) 2 |( 3 W + £3(x) 2 ) 2 + —T3Y R o} 
+7C(p) 2 e 2 4~ 1) . 


Further 


k— 1 / /c—ri —1 

4!l = E W E 

ri=0 \ r 2=0 


r s =0 


E P T ' (4-1, k-r t 

7 - 1=0 


i(Q 


2v 2 


Plugging in (4.5) we get for s >2 


k -1 


2^/ \2 S_1 


4:U£^ ri 7^*C(p) 

ri=0 V 


- P 


E S — 2 9 t 
t=0 Z 


( 3 (x) + e 3 (x) 2 ) 


2 s " 


+P 


P-' ~ 1 


E?=o 2 ‘ 


P.n 


+ 7 S.-J2. (c(p)£) 2- I ^) i 


Shifting the index this gives 


41 <tE,” ^s;: 2 - cW 2-|(_1_) e “‘' 2 “ ( 3W+£3W 2f 


P 


+p K 


P - 1 -! 


£t=i 2 ‘ 


Rn 


+7 Ec:2- (c(rte) 2- (j4 (<-i) i) 2V 
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Direct calculation then leads to 

4°, < 7^=o 2t C(p) 2 ° 


Es — 1 ot 

t=0 z 0 OS 

( 3 (x) + e 3 (x) 2 ) 


- P 


+P 


P~ l — 1 


1 \ E?=o 2 * 


Rr 


k -1 


+7EE. l2 ‘(c( P ) £ ) 2 ' V ^.(^Sh,) 2 . 


ri=0 


which gives (4.5) with (4.6). Similarly we can prove 


A W =_ 

s,k \l- 


2 s —1 


Rq S - 


Abbreviate 


, s del ( 1 

3s(x) = 


1-/3 


A/^T 25 - 1 ^) 25 , /3 s = f 7 2S_1 (C(/3)e) 2S , 

( 3 (x)+ 63 (x) 2 ) 2 , R.5 = f R^ 


Then 


(0 / 
4 < 


C(p) { (a(x) + £3 (x) 2 ) + p fc R 0 + eA^} 

Z-l s-1 Z-l s-1 Z-l 

- As n Prh( x ) + P fc emi^+11**- 


s=0 r =0 


We estimate further 

l — l S — 1 


s=0 r =0 


Z-l s-1 


r=0 


Prh( x ) - C(p) ( 3 (x) + C 3 (x) 2 ) = Y As n firis (x) 


s=0 r =0 

Z-l 


s=l r =0 


<E7 2S C(p) 2S+1 6 2S - l/ 1 


s=l 


1-/3 


2 s —1 


2\2 S 


( 3 (x) + e 3 (x) 2 ) 


(4.7) 


= el' 


! C(P) 4 (y 3^) 4 x ) + e 3W 2 ) 2 Y1 ( e7C ^Y~ (3( x ) + e 3( x ) 2 )^ 


Assuming (2.6) this gives 
Z—1 s—1 


y Xs n /3r3s(x) < C(/3) ( 3 (x) + £ 3 (x) 2 ) 


s=0 r=0 


+e 


7 2 C(p) 4 
l-c(e, 3 (x)) \l-p 


(3(x) + £3(x) 2 ) 2 . 












32 


Convergence of an alternation procedure 


With the same argument we find under (2.7) that 


l -1 s-1 

s=0 r=0 


^C(p)Ro + e 


1 


7 2 C (p ) 4 

~ c(e,R 0 ) 



Additionally (2.7) implies 


( i / i 

~[/3 r Ri < I e7C(p)— T -- 

In V P ~ 1 


2 i-i 


R 


2 l 

0 


-> 0. 


Plugging these bounds into (4.7) and letting l —>• oo gives the claim. 


□ 


4.3 Result after convergence 

In the previous section we showed that 

I2(x) C p| | sup j— ^-||y(u)|| - 2r 2 j < } Q (x,2p* + 2p) 2 \ 

r<4R 0 (x) ^ uGr °( r ) ^ ' ' 

n P [v k , k € To (r^ , V kik+1 € To (r^ | n {v, v e * € T o (r 0 )}, 
km 

where rj/ is defined in (4.4) or (4.3). The claim of Theorem 2.2 follows with the following 
lemma: 

Lemma 4.7. Assume (£Di), (£q) j an d (I) with a central point v° = v* and T> 2 = 
V 2 JE£(v*) . Then it holds on l?(x) C fl that for all k E N 


\\D(e k -e*) -||| < & 3 (r fc ,x), (4.8) 

|2 L{0k, e*) - ||||| 2 | < 8 (||L> _1 V|| + <0> Q (r fe ,x)) <} Q (2(l + p)r k ,x) 

+<>Q(rk,x) 2 , (4.9) 

where the spread <0>(r,x) is defined in (2.5) and where 


Proof. The proof is nearly the same as that of Theorem 2.2 of Andresen and Spokoiny 
(2013) which is inspired by the proof of Theorem 1 of Murphy and Van der Vaart (1999). 
So we only sketch it and refer the reader to Andresen and Spokoiny (2013) for the skipped 
arguments. We define 


l:lR p xTmlR , (0i, 0 2 , ri)mL{G l ,r) + H~ 2 A~ V ( 6 2 - 0i)). 
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Note that 


V ei l(Oi,d 2 ,v) = Vo£(Oi,r} + H 2 A T (0 2 -0i)), 6 k = argmax/(0, 6 k , rj k ), 

e 

such that \7gL(6 k ,rj k ) = 0. This gives 

\\D(o k -o*)-l\\ = \\D- 1 VL(e k ,7j k )-i)- 1 vL(v*) + D(e k -e*)\\. 

Now the right hand side can be bounded just as in the proof of Theorem 2.2 of Andresen 
and Spokoiny (2013). This gives (4.8). 

For (4.9) we can represent: 

L{6 k ) - 1(6*) = l(6 k ,6 k ,rj k+1 ) - KO*, 0*,Vo*), 


where 


r) e * — Ilf) argmax£(«). 

ver, 

n e v=o* 


Due to the definition of 6 k and Vk+i 


l(0 k , 0\ve*) ~ W,0*,rj e .) < L(0 k ) - 1(0*) < l(0 k , O k ,rj k+1 ) - 1(6*, O k ,rj k+1 ). 


Again the remaining steps are exactly the same as in the proof of Theorem 2.2 of Andresen 
and Spokoiny (2013). 

□ 


5 Proof of Corollary 2.3 

Proof. Note that with the argument of Section 4.1 !P(f2'(x)) > 1 — 8e -x — /3(A) where 
with l?(x) from (4.1) 


17'(x) = 17(x) n {n> € T'o(ro)}. 

On I7'(x) it holds due to Theorem 2.2 and due to Theorem 2.1 of Andresen and Spokoiny 
(2013) 


\\D(0 k - 6*) - ||| < <f Q (r k , x), || D(0 - 6*) - ||| < $(r 0 , x). 


Now the claim follows with the triangular inequality. 


□ 
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6 Proof of Theorem 2.4 


We prove this Theorem in a similar manner to the convergence result in Lemma 4.5. 
Redefine the set 1?(x) 

K 

l?(x) c = P| (C k ,k n C k ,k+ 1 ) n C(V) n {£(C 0 , v*) > -K 0 (x)}, where 

k=o 

C k,k(+ 1) = {|p(ufc,fc(+i) - v*)\\ < Ro(x), ||D(0* - e*)\\ < Ro(x), 

\\ H (Vk(+i) - V*)\\ <Ro(x)}, 

C(V) = < sup ||V(V 2 )(-d)|| < 9 u 2 W23i(x,6p*)R 0 (x) 

[wer o (R 0 (x)) 

n{||D- 1 v 2 C(u*)||<3(x,v 2 C(u*))}. 

where 

V(V 2 )(«) d = IT 1 (V 2 C(-d) - V 2 C(-d*)) G M p * 2 . 

We see that on l?(x) 

'—^ Hiaf 

v kM+1) € To(Ro) = {||D(u - 5)|| < Ro + r o} n r o (R 0 ). 

Lemma 6.1. Under the conditions of Theorem 2.f 


lP(l?(x)) > l-3e /^(A) • 

Proof. The proof is very similar to the one presented in Section 4.1, so we only give a 
sketch. By assumption 

P (||R- 1 V 2 C(^)|| < 3 (x, V 2 C(v*))) > 1 - e -x , 


and due to (£R 2 ) with Theorem 9.2 

P ( sup ||V(V 2 )(-d)|| < 9v 2 u;23i(x,6p*)Ro(x) ) > l-e _x . 
\uer o (R 0 (x)) / 

Lemma 6.2. Assume for some sequence (r®) that 

P| {||®(t5fc,fc(+i) “ ^)H ^ r i°} £ ^( x )- 

feeN 


□ 
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Then we get on h?(x) 

fc-i 

P(v kM +i)-v)\\ ^ 2\/2(i + v /p)^/||' r ( r E r )|| +2V2p k (R 0 + r 0 ), 

r =0 

= : 4 +1 \ (6.1) 

where 

||r(r)|| < [5'( J R 0 ) + 9i/2a;2||D- 1 ||3i(x,6p*)^o + \\T>~ 1 || 3 (x, V 2 C(f*))] r. 

Proof. 1. We first show that on 12 (x) 

D( 6 k - 6 ) = —D~ 1 A(ri k — rj) + r(r^ } ), 

m k -if) = -H- 1 A T (d k _ l -G) + r{4 ) ), 

The proof is very similar to that of Lemma 4.5. Define 

a(v,v) := L(v,v) + \\T>(v - v)\\ 2 / 2 . 

Note that 

L(v,v) = VL(v) — \\T>{v — v )\\ 2 / 2 + a(v, v*) 

= ~\\D (6 — ^) 11 2 / 2 + (0 — 0*) T A(rj — rj) 

-\\H(r] - ri)\\ 2 /2 + a{v,v). 

Setting V eT( 6 k ,rj k ) = 0 we find 

D{G k -9) = D~ 1 A(rj k -rj) + D^Vgaiv^, v). 

We want to show 

_sup D~ 1 V e a((0,^ fc ),e) < ||r(r£°)||, 

(0.%)ero(r«)nr o (Ro) 

where 

D-'Voafav) d = D-^VgLiv) - D 2 (6 -0) - A{lj k - rj)}. 

To see this note that by assumption we have l?(x) C {v g T 0 (ro)} C {v e T 0 (Ro)}. By 
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condition (£o), Lemma 4.2 and Taylor expansion we have 


_sup \\]EU(0, r] k )\\ 

(e,v k )er o (r^)nr o (R 0 ) 

< sup \\D~ l n e (y]EL{v)- VlEL{v)-T)(v-v*)\\\ 
uen(4 0 )nr o (R 0 ) 

< sup ||D" 1 i7 0 T||||2)- 1 V 2 iL£(«)T- 1 -/p*||4 0 
'VdzlTo (Ro) 

< 

For the remainder note that with ( = L — 1EL on l?(x) using Lemma 4.2 we can bound 


sup 

(e,^)e^(r«)nr o (R 0 ) 


< 


sup 


uer 0 (r®)nr 0 (Ro) 


Ue(0,V k ) ~ JEUo(0,ri k ) 

'(VeC^-VeC^)) 


D~ 


< sup ||t- 1 v 2 c(w)t- 1 ||4° 

veTo(r) 


< sup 
■L>eT o (R 0 ) 


{^ll^ 1 ( v2((v) - v2((v "» 


.(0 


+ <j ||D- 1 V 2 C(^)D- 1 " R (0 


< [9u 2 o; 2 ||T- 1 ||3 1 (x,6^)Ro + ||T- 1 ||3(x,V 2 CK))] 


JO 


Using the same argument for r] k gives the claim. 


Now the claim follows as in the proof of Lemma 4.5. □ 

Lemma 6.3. Assume that 5(r)/rV9i'2W2V||2)~ 1 || < e 2 . Further assume that x(x, Rq) < 
1 — p where 

x(x,i?o) -^-X==^-(fi(Ro) + 9w 2 ^2||I) _1 ||3i(x,6p*)i?o 

V 1 - P \ 

+ \\V-%(x,X7 2 L(v*))y 

Then 

4?(x) C p| {^ fc ,fc(+i) € T 0 (r fc )} , 
fceN 

where satisfy the bound (2.12). 
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Proof. Define for all G No the sequence r[ 0) = Ro . We estimate 

||T(r[°)|| < *- (5(R 0 ) + 6 i^ia; 2 ||D^ i || 3 i(x, 6 p*)Ro + || R_ 1 ||3( X , ®(V 2 )) r[°, 

v 1 ~ P 

such that by definition 

k—1 k—1 

2V2{1 +y/p)J2p r \\ T ( r k-r)\\ ^ ^( X > R 0 

r=0 r=0 


Plugging in the recursive formula for r[ ?) from (6.1) and denoting Rq == Rq + tq we find 


def 


k—1 


r k < ^( x , R o) P rj: k-r + 2v / 2p fc Ro 

r=0 

k —1 / fc—r—1 




< x(x, Ro) / ( x(x, Ro) ^ p Sr k_r-s + 2 P k ~ r Ro ) + 2 V 2 R 0 / 0 " 

r=0 \ s=0 / 

k—1 k—r—1 

< x( x , R o) 2 22 P r P Sj: k-r-s + 2 \/2//' R 0 (x(x, R 0 )fc + 1) 


r=0 s =0 

fc—1 k—r—1 


k—r—s—1 


<x( X ,r<,) 2 ^/ e / u*.R«) e Aihv,+2/-’-*Ro 

r=0 s=0 \ t=0 / 

+2\/2p fc R 0 (x(x, Ro)fc + 1) 

fc—1 k—1 —1 

x ( x > R o) 3 22 P r ^2 P Sz k-r-s + 2\/2p fcR 0 (x(x, R 0 ) 2 /c 2 + x(x, R 0 )fc + 1 ) 


< 


r —0 s=0 

By induction this gives for l G N 


*-£i=ir.-i 


k—1 k—r i — l 

rl ,) <Mx,Ro) i E'>" E f"- E ^ 


ri=0 r2=0 

Z-l 

+2v/2p fc R 0 x(x, R 0 )^ s 


n=o 


< 


< 


s=0 

x(x,R 0 )V 
. 1 -P 


Z-l 


+ 2V2p k 22(^ R o)k) s R o 


s=0 


x(x,Rp) 

1-p 


+ SVV x-^x.Rolfc ) R °’ x(x ’ Ro)fc - R 


^( x , R o)' ((l^) + 2 ^P fc x(x,Ro)fc-l ) Ro ’ otherwise - 


By Lemma 6.2 


p| n {vfc,fc ( +i) e To (4°)} . 
fceN 0 zeN 
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Set if x(x, Rq)/(I — p) < 1 


m d = J 00 ’ 

‘ k log(p)+log(2\/2)-log(x(x,Ro)fc-l) 
-log(l-p)-log(fe) ’ 


mi • i 1 * def (I l(k) I) 

Inen with r'jl = we get 


x(x, Ro )k < 1 , 
otherwise. 


^( x ) C p| {w fcl fc (+1) € F 0 (r£)} , 
fee N 0 

as claimed. 


4 < 


11 2 ^ 


:Rn 


1—x(x,Ro)A: 

x(x,Rq) \ log(fc) 1 

1-p 


Ro 


x(x, Ro)fc < 1 , 


otherwise, 


□ 


7 Deviation bounds for quadratic forms 

This section is the same as Section A of Andresen and Spokoiny (2013). The following 
general result from Spokoiny (2012) helps to control the deviation for quadratic forms 
of type ||IB £|| 2 for a given positive matrix IB and a random vector It will be used 
several times in our proofs. Suppose that 

log IE exp( 7 T £) < H'y|| 2 /2, 7 € JR P , ||'y|| < g- 

For a symmetric matrix IB , define 

p = tr (IB 2 ), v 2 = 2tr(® 4 ), A* = y® 2 ^ d 4 f A max (® 2 ). 

For ease of presentation, suppose that g 2 > 2pje- The other case only changes the 
constants in the inequalities. Note that ||£|| 2 = r) T IB rj . Define fi c = 2/3 and 

def / 9 

g c = Vg “McPiB, 

2 (x c + 2 ) = f (g 2 l[ic - p®)/A* + logdet (l p - /z, c B /\*). 

Proposition 7.1. Let ( EDq ) hold with v 0 = 1 and g 2 > 2p jb . T/ien for each x > 0 

P{\m\\ >3(x,®)) < 2 e" x , 

where $(x,!B) is defined by 

Pjb + 2vjb(x + 1) 1/2 , x + 1 < Vjb/(18A*), 

Pjb + 6 A*(x + 1), v jB /( 18A*) < x+ 1 < x c + 2, 

|y c + 2A*(x — x c + 1)/gc 1 2 , x>x c + l, 


0 / tt» \ del 
3 2 (®,x) = 


with y 2 < pjb + 6 A*(x c + 2 ). 
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8 A uniform bound for the norm of a random process 

We want to derive for a random process }}(v) € FTP a bound of the kind 

P ( sup sup j-||y(w)|| - 2r 2 ) > C 3 Q (x,p*) ) < e _x . 
yr<r* D£To(r) J J 

This is a slightly stronger result than the one derived in Section D of (Andresen and 
Spokoiny, 2013) but the ideas employed here are very similar. 

We want to apply Corollary 2.5 of the supplement of Spokoiny (2012) which we cite 
here as a Theorem. Note that we slightly generalized the formulation of the theorem, to 
make it applicable in out setting. The proof remains the same. 


Theorem 8.1. Let (C/(r))o< r <r* C FLP be a sequence of balls around v* induced by the 
metric d(-, •) . Let a random real valued process U(r,i;) fulfill for any 0 < r < r* that 
U(r, v*) = 0 and 


(£d) For any v,v° € U( r) 

( U(r,w)-U( r ,w°)l zy(,, 

iogiEex n A ivj) /-^ |A| - g - 


(A 2 


( 8 . 1 ) 


Finally assume that sup vg[ /( r )(U(r, v)) increases in r. Then with probability greater 
1 -e" x 

sup |^-U(r,n>)-d(u,t>*) 2 ) < z Q (x,p*) 2 , 
v€U( r) l J 

def 

where 3 g(x,p*) = Q(U(r*)) denotes the entropy of the set U{ r*) C FL P and where with 
go = v og and for some Q > 0 

/ def / ( 1 + \/x + Q) 2 if 1 + \/x + Q < go, 

3q(x,Q) 2 = \ (8.2) 

[ 1 + { 2 g 0 1 (x + Q) + go } 2 otherwise. 

To use this result let y(t>) be a smooth centered random vector process with values 
in M p and let D : 1R P * —>• 1R P * be some linear operator. We aim at bounding the 
maximum of the norm ||y(t 7 )|| over a vicinity T a (r) = f {||D(i 7 — i 7 *)|| < r} of v* . 
Suppose that V(r>) satisfies for each 0 < r < r* and for all pairs v,v° G T 0 (r) = G 
T: ||D(u-u*)|| < r} C Fl p * 


sup log IE exp 

lltxll <1 


w T (y(t>) - y(i>°)) 
oj\\T>(v — i>°)|| 


< 


. .2 \ 2 
n 0 A 


(8.3) 
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Remark 8.1. In the setting of Theorem 2.2 we have 

yH = ^- 1 (vcH-vc(^)), 

and condition (8.3) becomes (£Di) from 2.1. 

Theorem 8.2. Let a random p -vector process })(v) fulfill V( v*) = 0 and let condition 
(8.3) be satisfied. Then for each 0 < r < r* , on a set of probability greater 1 — e -x 

sup sup ||y(u)|| -2r 2 ) < 3 Q (x, 2 p*+ 2p) 2 , 

r<r* ue r 0 (r) l J 

with go = v 0 g . 

Remark 8.2. Note that the entropy of the original set T a ( r) C 1R P * is equal to 2 p* . 
So in order to control the norm ||V("d)|| one only pays with the additional sumand 2 p. 


Proof. In what follows, we use the representation 


||y(t>)||=w sup 


u||<||D(w-v*)|| W||D(U - V*)|| 


u V(u). 


This implies 


sup 

weTo(r) 


iiy(«)ii = w 


1 

sup sup ——-- 

ueTo(r) ||u||<||D(u-u*)ll W\\V(V - 


V* 


u T y(v). 


Due to Lemma 8.3 the process U(r, v, u) u .||t>(J— l ,*)|| uT ^( v ) satisfies condition (Ed) 
(see (8.1)) as process on U(r*) where 


U (r) ^ T’o(r) x R r (0). 


(8.4) 


Further sup^^^) U(r, v, u ) is increasing in r . This allows to apply Theorem 8.2 to 
obtain the desired result. Set d((v,u), (v°,u°)) 2 = ||D(i> — •u*)|| 2 + ||u — rt°|| 2 . We get 
on a set of probability greater 1 — e _x 


sup 


u T y(v) — \\D(v — v 


(v,u)eu( r*) 1 Gconi\\T)(v — v* 
< i Q (x,Q(U(r*))y 


*U' 2 _ || u ||2 


The constant Q(U( r*)) > 0 quantifies the complexity of the set U( r*) C M p * x M p . 
We point out that for compact M C IRP we have Q (M) = 2 p* (see Supplement of 
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Spokoiny (2012), Lemma 2.10). This gives Q(U) = 2 p* + 2 p. Finally observe that 


sup sup 


r<r* i)STo(r) l 


■||y(^)H-2r 2 


< sup sup i - —J - 

r<r* („,u)el/(r) l 6wi>i ||!D(t> - t>*)|| 


ii T 9(r>) — ||D(i7 — t?*) ||^ — ||u 


= sup 


1 


U T y(v) — || D(v — 'U*)|| 2 — || u 


(w,«)el7(r*) l6wi/i||D(u-u*)|| 


□ 


Lemma 8.3. Suppose that ty(v) satisfies for each ||«|| < 1 and |A| < g the inequality 
(8.3). Then the process U(v,u) = o cj ||d(^_ u *) | | 9(^) t ^i satisfies (Ed) from (8.1) with 
|A|<g/2, d((v,u), (v°,u°)) 2 = \\T)(v — v*)\\ 2 + \\u — u°\\ 2 , v = 2 vq and U C 1R P * +P 
defined in (8.4), i.e. for any (v, u\), (v°, u 2 ) € U 


log IE exp 


x U(^,tti) -U(t>°,tt 2 ) ) < z^qA 2 

d((v,ui),{v°,u 2 )) J ~ 2 


|A| < g/2. 


Proof. Let (v, u\), (v°, u 2 ) G U and w.l.o.g. u\ < ||D(t> — i>*)|| < ||D(t>° — v*)\\ . By 
the Holder inequality and (8.3), we find 


i n? f x U(v,m) -U(v,u 2 )} 

log IE exp< A—-——-y > 

{ d((v,u i),(u°,i* 2 )) J 

= log IE exp 


< - log IE exp <J 2A 


+- loglEexp<l 2A 


< sup 
||u||<l 


EU(v, u\) — U(v°,ui) + U(v°, u\) — U(v°,u 2 ) 1 
l d((v,ui),(v°,u 2 )) J 

1 (||U(-u-u*)||^( ,U ) ||D(-u° 1 -w*)ll^’ U °^ 


u;||T>(-d —1>°)|| 

(uj - uj)y(v°) 
- o;||mi - u 2 \\\\T>(v - u*)|| 


< 


+ sup 
Hl<i 

4^ 2 A 2 


1 , ^ « T (y(«)-y(^°)) ) 

- loglEexp< 2A ——-—— > 

i, ^ r^ uT $(”°)-$(«*)) i 

— log IE exp % 2A- — --— \ 

2 I w||D(u-«*)|| J 


A < g/2. 


□ 
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9 A bound for the sprectal norm of a random matrix pro¬ 
cess 

We want to derive for a random process € ]R p * xp * a bound of the kind 

P [ sup {||y(v)||) > Co; 23 i(x,p*)r ] < e" x . 

\«er 0 ( r) 1 J ) 

We derive such a bound in a very similar manner to Theorem E.l of Andresen and 
Spokoiny (2013). 

We want to apply Corollary 2.2 of the supplement of Spokoiny (2012). Again we 
slightly generalized the formulation but the proof remains the same. 


Corollary 9.1. Let (U( r))o< r <r* C 1R P be a sequence of balls around v* induced by 
the metric d(-, •). Let a random real valued process U(v) fulfill that U(r>*) = 0 and 


(£d) For any v,v° € U( r) 

, ^ (U(r)-UK)1 ..... 

log IE exp < A— 1 —-—— \ < A < g. 

I d(v,v°) J 


7?A 2 


(9.1) 


Then for each 0 < r < r* , on a set of probability greater 1 — e 


sup U(v) < 3i'i$i(x,p*) 2 d(v,v*), 
veu( r) 


where 3 i(x,p*) = Q(t/(r*)) denotes the entropy of the set U( r*) C M p and where with 
go = vog and for some Q > 0 


3i(x,Q) 


def 


y / 2(x + Q) 
go 1 ( x + Q) + go/2 


if i|/2(x + Q) < go, 
otherwise. 


To use this result let V(t>) be a smooth centered random process with values in 
]R p * xp * and let T> : M p * —>• 1K P * be some linear operator. We aim at bounding the 
maximum of the spectral norm ||y(t;)|| over a vicinity X D (r) c = {||i> — v*\\y < r} of 
v* . Suppose that V(i?) satisfies V( v *) = 0 and for each 0 < r < r* and for all pairs 
v,v° € T 0 (r) = {reT: ||v — v*\\y < r} C M p * 


sup sup log E exp 

tZl||<l 11 *1X2 ||^1 


ii7(y(^)-y(^ o ))w2] 

U2mv-v°)\\ f 


< 


v|A 2 


2 


(9.2) 
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Remark 9.1. In the setting of Theorem 2.4 we have ||t> — t>°||y = ||CD(u — -d°)|| and 

y(«) = !D“ 1 V 2 C(u) - D _1 V 2 C(v*), 

and condition (9.2) becomes (8T> 2 ) from 2.1. 

Theorem 9.2. Let a random process V(i?) G ]R p * xp * fulfill ^(v*) = 0 and let condition 
(9.2) be satisfied. Then for each 0 < r < r* , on a set of probability greater 1 — e -x 


sup ||y(r>)|| < 9 w 2 ^ 23 i(x, 6p*)r, 

'L’STo(r) 

with go = v 0 g . 

Remark 9.2. Note that the entropy of the original set T 0 (r) C M p is multiplied by 3. 
So in order to control the spectral norm ||y(i;)|| one only pays with this factor. 

Proof. In what follows, we use the representation 

\\y(v)\\=U} 2 sup sup —^uj^(v)u 2 . 

||n.2||<r 11*1X21|^2^* 


This implies 

sup ||V(t>)||=o; sup sup sup -^ujy(v)u 2 . 

uGTofr) u£r o (r) ||t*2||<r ||u2||<r Wr 

Due to Lemma 9.3 the process U(t>) c = ^fijuJ^(v)u 2 satisfies condition (8d) (see (9.1)) 
as process on 


U (r) = T 0 ( r) x B r ( 0) x R r (0) C JR 3p * . 


(9.3) 


This allows to apply Corollary 9.1 to obtain the desired result. We get on a set of 
probability greater 1 — e _x 

SUp ||y(t>)|| < SUp i-^uly{v')U 2 \ < 9^2^231 ( x >Q(^( r *))) r - 

i>eT 0 (r) (v,ui,U2)£U(t) l r J ' ' 

The constant Q(£/(r)) > 0 quantifies the complexity of the set U( r) C JR 3p . We point 
out that for compact M C lR 3p * we have Q(M) = 6p* (see Supplement of Spokoiny 
(2012), Lemma 2.10). This gives the claim. □ 


Lemma 9.3. Suppose that y(v) € M p * xp * satisfies ^(v*) = 0 and for each ||wi|| < 1 , 
||U 2 1| < 1 and |A| < g the inequality (9.2). Then the process 


1 

2o; 2 r 2 


uJy(v) T u 2 


U(v, U\, u 2 ) 
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satisfies (£d) from (9.1) with U C M 3p defined in (9.3), with |A| < g/3 and with 
d((v,u), (v°,u°)) 2 = || T>(v - w*)|| 2 + ||wi - iii|| 2 + ||w 2 - W 2 II 2 , 


i.e. for any (v, u\, u 2 ), (v°, ufi u^) € U 

logEexp-f U1 ' U2 ) ~ ^(^°> u v u 2) 


< 


9v|A 2 


|A| < g/3. 


I d((v,u 1 ,v 2 ),(v 0 ,u° 1 ,u°)) J 2 
Proof. Let (v,u\, u 2 ), (u 0 ,^,^) € U . By the Holder inequality and (9.2), we find 

loggexp(A ^"’" 1 ’" 2) r/ (,, °-" ; ’"p 

l d((w,w 1 ,u 2 ),(u°,uf,i*5)) 

1 tw f\ /U(v,ui,u 2 )-U(v°,u 1 ,u 2 ) U(w°,wi,w 2 )-U(w°,w?,w 2 ) 

= log IE exp < A 777 - ru~—-—+ 


+ 


d((v,u 1 ,u 2 ),(v°,u° 1 ,u°)) 

U(v°,ufiu 2 ) — U(v 0 ,ufiu%)' 
d((v,u 1 ,u 2 ),(t; 0 ,uf,u:>)) 


d((w,wi,w 2 ),(w°,w°,w°)) 


1 


/^ f ox <(?ry(v)-))«2] 

< - log IE exp < 3A- —— - — - \ 

3 l w 2 p)r-u° 


1 


f , (wi — w?) T )y(w°)w9) 

+ -logEexp(3A ( -} 

+ llog JE exp{3A W) T f t ’°) ( ^ | -^) 
3 L w 2 ||w! - w 2 ||r- 


< - sup sup log IK exp 
3 ||iti||<i||it 2 ||<i 


f 0 , u I (y(«)-y(«°))w2) 

l w 2 ||D(w — w°)|| / 


,2 f uj(y(v)-y(v))u 2 ) 

+- sup sup log IE exp < 3 A- — - — - \ 

{ a; 2 Dw-w* J 


||ui||< 1 ||it 2 ||<l 


< 


9u|A 2 


A < g/3. 


□ 
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